Statistical Machine Translation for Twitter
نویسندگان
چکیده
We consider the problem of translating short messages (Tweets) using Europarl as a starting-point. After highlighting some of the domain differences between Europarl and Twitter, we show that for German-English translation, we can improve performance from a baseline BLEU score of 25.58 to 53.45. By far and away the single most important improvement is passing-through unknown words (which are mainly URLs). Enforcing the length constraint upon translated output turnsout to be relatively simple. Since our Twitter translation involves little reordering, we conclude that the biggest challenge is lexical: dealing with unknown words, spelling mistakes, creative orthography and Twitter-idioms.
منابع مشابه
A new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملLinguistic steganography on Twitter: hierarchical language modeling with manual interaction
This work proposes a natural language stegosystem for Twitter, modifying tweets as they are written to hide 4 bits of payload per tweet, which is a greater payload than previous systems have achieved. The system, CoverTweet, includes novel components, as well as some already developed in the literature. We believe that the task of transforming covers during embedding is equivalent to unilingual...
متن کاملUnsupervised cleansing of noisy text
In this paper we look at the problem of cleansing noisy text using a statistical machine translation model. Noisy text is produced in informal communications such as Short Message Service (SMS), Twitter and chat. A typical Statistical Machine Translation system is trained on parallel text comprising noisy and clean sentences. In this paper we propose an unsupervised method for the translation o...
متن کاملThe Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language
Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...
متن کاملTranslating Government Agencies' Tweet Feeds: Specificities, Problems and (a few) Solutions
While the automatic translation of tweets has already been investigated in different scenarios, we are not aware of any attempt to translate tweets created by government agencies. In this study, we report the experimental results we obtained when translating 12 Twitter feeds published by agencies and organizations of the government of Canada, using a state-ofthe art Statistical Machine Translat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013